A modular tool to aggregate results from bioinformatics analyses across many samples into a single report.
/cfs/klemming/projects/supr/uppstore2017170/rawData/hg19/WGS_longread/231004_PacBio_MM_celllines/analysis/sequali
General Statistics
| Sample Name | GC % | Mean length | Total reads | % est. dups. |
|---|---|---|---|---|
| pr_023_001_OPM2_hifi_reads.default | 40.46% | 12726.4bp | 6.8M | 0.02% |
| pr_023_002_KMS12BM_hifi_reads.default | 40.75% | 12006.8bp | 7.0M | 0.02% |
| pr_023_003_MM1S_hifi_reads.default | 40.41% | 12250.3bp | 6.1M | 0.03% |
Sequali
1.0.2
Sequencing quality control for both long-read and short-read data.URL: https://github.com/rhpvorderman/sequaliDOI: 10.1093/bioadv/vbaf010
Features adapter search, overrepresented sequence analysis and duplication analysis and supports FASTQ and uBAM inputs.Sequence Counts
Sequence counts for each sample. Duplicate read counts are an estimate.
This plots shows the total number of reads broken down into
unique and duplicate reads.
The methodology to estimate duplication uses fingerprinting with subsampling based on the fingerprints themselves. This mitigates biases that might occur in estimates that only look at the first reads.
Sequali fingerprints by combining an 8 bp fragment at an offset of 64 bp from the beginning with an 8 bp fragment offset at 64 bp from the end. The offsets were chosen to limit the chance of adapter sequences contaminating the fingerprint.
Sequence Quality Per Position
The mean quality value across each base position.
Only mean scores are plotted. The means are approximated as Sequali stores 12 phred categories per position: 0-3, 4-7, etc up to 44 and higher. It does not store all 94 discrete phred score counts for each position. For context, Illumina FASTQ files only utilize four different phred scores.
As Phred scores are logarithmic, the means are calculated by calculating the probability for each base and then averaging that over the total number of bases. The probability is then converted back into a Phred score. Tools that average Phred scores naively are prone to overestimate the average quality by orders of magnitude. As such Sequali might give a different plot here than other QC tools.
Per Sequence Average Quality Scores
The number of reads with average quality scores.
Shows the quality score profile on a read level. As Illumina FASTQ files only utilize four different phred scores, the plot may look a bit erratic at times. Due to the logarithmic nature of Phred scores, lower Phred scores have a more significant impact on the average quality as than higher phred scores.
As Phred scores are logarithmic, the means are calculated by calculating the probability for each base and then averaging that over the total number of bases. The probability is then converted back into a Phred score. Tools that average Phred scores naively are prone to overestimate the average quality by orders of magnitude. As such Sequali might give a different plot here than other QC tools.
Per Position GC Content
The GC content percentage at each position for each sample.
Per Sequence GC Content
The GC content distribution of the sequences for each sample.
Sequence Length Distribution
The distribution of read lengths found.
Sequence Duplication Levels
The relative level of duplication found for every sequence.
The methodology to estimate duplication uses fingerprinting with subsampling based on the fingerprints themselves. This mitigates biases that might occur in estimates that only look at the first reads.
Sequali fingerprints by combining an 8 bp fragment at an offset of 64 bp from the beginning with an 8 bp fragment offset at 64 bp from the end. The offsets were chosen to limit the chance of adapter sequences contaminating the fingerprint.
Top overrepresented sequences
The top 20 overrepresented sequences in all libraries
| Sample Name | Best Match | Libraries Affected (%) |
|---|---|---|
| AAAAAAAAAAAAAAAAAAAAA | Poly-A/T repeat. Common pattern in Human Genome. | 100.00% |
| ACACACACACACACACACACA | Poly-CA/GT repeat. Common pattern in Human Genome. | 100.00% |
| CACACACACACACACACACAC | Poly-CA/GT repeat. Common pattern in Human Genome. | 100.00% |
Adapter Content
The cumulative percentage count of the found adapter sequences
Note that only samples with >= 0.1% adapter contamination are shown. There may be several adapters detected per sample. For long read data there maybe more adapters per sample, this is a result of the false positive detection rate increasing with longer read length.
Software Versions
Software Versions lists versions of software tools extracted from file contents.
| Software | Version |
|---|---|
| Sequali | 1.0.2 |